Skip to content

Add fix for prerenderer cache-clearing on publish#4719

Merged
backspace merged 33 commits into
mainfrom
published-caching-cs-11043
May 14, 2026
Merged

Add fix for prerenderer cache-clearing on publish#4719
backspace merged 33 commits into
mainfrom
published-caching-cs-11043

Conversation

@backspace
Copy link
Copy Markdown
Contributor

@backspace backspace commented May 7, 2026

Publishing a realm has been inconsistent in whether the changes actually show up, which has indeed been about caching, but across multiple layers. I’m out of my depth here, but here’s Claude’s explanation (edited after PR feedback about unnecessary work):

Bug

After republish, the published URL kept serving pre-republish HTML. In production:
nyuitp2026.boxel.site served the old wordmark for ~37h after the user pushed the new img form.

Root cause

The Realm class holds an in-memory #sourceCache keyed by path. After a republish:

  1. Publish handler does the FS swap and enqueues a reindex.
  2. The index worker fetches source over HTTP into the realm-server.
  3. getSourceOrRedirect returns pre-swap bytes from #sourceCache (plus a stale content-hash etag).
  4. The worker writes stale HTML into boxel_index.isolated_html.

The design assumed the NodeAdapter file-watcher would invalidate #sourceCache, but
ENABLE_FILE_WATCHER is unset in staging/production, so it never fires.

Fix

One commit. New Realm.clearLocalCaches() method; handle-publish-realm calls it after the FS swap,
before the reindex enqueue. Makes the invalidation synchronous — no race, no file-watcher
dependency.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Preview deployments

Host Test Results

    1 files  ±0      1 suites  ±0   1h 54m 0s ⏱️ - 9m 6s
2 661 tests ±0  2 646 ✅ ±0  15 💤 ±0  0 ❌ ±0 
2 680 runs  ±0  2 665 ✅ ±0  15 💤 ±0  0 ❌ ±0 

Results for commit ceeed01. ± Comparison against earlier commit df9d319.

Realm Server Test Results

    1 files  ±0      1 suites  ±0   11m 6s ⏱️ +8s
1 367 tests  - 1  1 367 ✅ ±0  0 💤 ±0  0 ❌  - 1 
1 446 runs   - 1  1 446 ✅ ±0  0 💤 ±0  0 ❌  - 1 

Results for commit ceeed01. ± Comparison against earlier commit df9d319.

backspace and others added 3 commits May 11, 2026 09:29
Covers the gap that let CS-11043 ship: every existing publish-realm
test does a single publish, so a republish that signals success but
serves stale content has no regression net. The new test defines a
sentinel card, publishes, asserts the initial sentinel renders on
the published URL, edits the source, republishes, and asserts the
updated sentinel renders (and the initial one is gone).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
backspace and others added 4 commits May 12, 2026 16:23
The host-side Loader-reset approach (sticky-for-batch + the
two-flavor clearCache/resetLoaderOnly refinement) was causing
widespread CI fallout — query-field server-hydration tests,
live-update tests, file-tree navigation, code-submode preview,
multi-reindex timeouts, host memory baseline. The Loader reset
also doesn't directly address the actual root cause investigated
under CS-11043: the stale module bytes the published nyuitp2026
realm served for ~37h were sitting in Chromium's process-level
HTTP cache, not the host's in-process Loader cache.

Roll back to pre-fix state. A follow-up commit on this branch
targets the Chromium-cache layer directly via Cache-Control on
the realm-server's source/module responses.

Kept on the branch (independently useful regardless of fix
mechanism):
  - packages/matrix/tests/publish-realm.spec.ts republish test
    (the regression net that fills the gap which let CS-11043
    ship in the first place).
  - infra/ Checkly canary script + Terraform + provisioning
    runbook on the checkly-publish-cs-11096 branch in the infra
    repo (production monitoring).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…11043)

The publish-republish failure mode investigated under CS-11043
was Chromium's process-level HTTP cache holding stale module
bytes (presentation.gts and friends) across publishes — verified
by inspecting the staging nyuitp2026 realm's failed render.

Source responses were sent with no `Cache-Control` and no
`Last-Modified` header (verified via `curl -sI`), which lets
Chromium apply heuristic caching. That cache lives at the
browser-process level, shared across all puppeteer pages on a
prerender-server task. After a republish, even freshly-spawned
puppeteer pages could pull old module bytes from the persistent
Chromium cache for ~37h, until the prerender-server task itself
rotated.

`Cache-Control: no-store` evicts the heuristic-cache vector
entirely. Every source/module fetch goes back to the realm-server,
which serves whatever bytes are current on EFS. Cost: no browser
cache reuse for unchanged source files — acceptable because card
content is prerendered into boxel_index.isolated_html by the
indexer and not typically re-fetched per page view anyway.

Applied at the single place every source response passes through
(`getSourceOrRedirect`'s defaultHeaders); cached-redirect entries
go through a separate header map that doesn't get the new value,
which is fine — 302 redirects are not the stale-bytes vector.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
backspace and others added 8 commits May 12, 2026 20:06
…(CS-11043)

The Cache-Control: no-store change targets Chromium's HTTP cache,
but the publish-republish staleness investigated under CS-11043
also lived a layer up: the prerender server's puppeteer pages
hold a host-app `Loader` that caches evaluated modules by URL.
After a republish swaps new bytes onto disk, that Loader would
keep handing back the OLD module on subsequent renders — the
HTTP layer's no-store header doesn't reach into the host's
module cache.

In production this manifested as nyuitp2026.boxel.site rendering
the wordmark form of presentation.gts for ~37h after publishing
the img form. The matrix republish test (added on this branch)
reproduces the same failure in the test env: the second publish's
reindex runs on the warm prerender, the Loader serves the
initial sentinel-card module, and the published URL serves the
initial sentinel even though disk has the updated source.

Fix: after the FS swap completes (and alongside the existing
DELETE FROM modules DB-cache clear), call
prerenderer.disposeAffinity({ affinityType: 'realm', affinityValue:
publishedRealmURL }). This tears down the puppeteer pages for
the affinity; the next render against the realm spawns a fresh
page that fetches modules from disk via the realm-server.

Made `disposeAffinity` optional on the Prerenderer interface
(matching the `releaseBatch?` pattern) so stub / remote
implementations aren't forced to provide it. The call is
best-effort: a thrown error is logged via log.warn but doesn't
fail the publish, since the page-pool's LRU rotation cleans up
eventually — we just want to avoid the long staleness window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(CS-11043)

The previous commit added a publish-handler call to
`prerenderer.disposeAffinity(...)`, but in any deployment that
routes prerender requests over HTTP (which is what every real
deployment uses, and what the matrix test infra uses via
isolated-realm-server), the realm-server holds a
`RemotePrerenderer` — which didn't implement `disposeAffinity`,
so the call silently no-op'd via `prerenderer?.disposeAffinity`.

Wire the call through the same plumbing the existing
`releaseBatch` path uses:

  - RemotePrerenderer.disposeAffinity() — POST /dispose-affinity
    on the prerenderURL (manager or single prerender server).
    Best-effort fetch with an abort timer so a stuck upstream
    can't block the publish handler.

  - prerender-app POST /dispose-affinity — accepts JSON:API body
    with {affinityType, affinityValue}, calls into the concrete
    Prerenderer's existing disposeAffinity() method (which clears
    the auth cache and tears down all puppeteer pages for the
    affinity).

  - manager-app POST /dispose-affinity — fans the request out to
    every server currently assigned the affinity, mirroring how
    release-batch broadcasts. Each server runs its own local
    disposal; the broadcast resolves when all targets do.

After this, the matrix republish test's second publish triggers
disposeAffinity end-to-end: realm-server publish handler →
RemotePrerenderer HTTP POST → manager fan-out → prerender server
disposes pages. The next render against the realm spawns a fresh
puppeteer page that fetches sentinel-card.gts from disk and sees
the updated sentinel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test keeps failing with the same shape — popup shows the
initial sentinel after the second publish — but we can't tell
which link in the chain is broken without ground-truth signals.
This commit adds checkpoint logging and direct-fetch
diagnostics so a CI failure tells us:

  - whether the source-content POST actually landed (compare
    GET'd source body against the updated sentinel value),
  - whether the default-domain checkbox state needed re-clicking
    after modal close+reopen,
  - whether the publish button was actually enabled at click
    time,
  - whether the /_publish-realm response was observed (and its
    status code if so),
  - what a plain HTTP fetch of the published URL returns vs. what
    the popup renders (decouples server-side staleness from
    browser-cache).

Also adds the previously-missing default-domain-checkbox click
on the second-publish path. The checkbox can lose its
selection on modal close, leaving `isPublishDisabled` true and
the publish click a silent no-op. The new `isChecked()` /
`isDisabled()` checks make that visible, and the conditional
re-click avoids the failure mode.

Strictly diagnostic + bugfix. No assertion changes; if the
test still fails, the console.log lines pinpoint which of the
chain links is the actual broken one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…index (CS-11043)

Test instrumentation (commit 62b3ee9) pinpointed the actual
broken link: even after disposeAffinity tears down the prerender
server's puppeteer pages and the publish handler DELETEs the
DB-level modules cache, the realm-server's OWN per-Realm
#sourceCache still holds the pre-swap bytes. When the immediately-
enqueued reindex's worker fetches modules via HTTP to this
realm-server, getSourceOrRedirect returns the cached old bytes;
the reindex renders against stale source; the rendered HTML
lands in boxel_index.isolated_html; the published URL serves
old content forever.

The Phase-3-PR-2 publish flow relied on the NodeAdapter file
watcher to invalidate the realm's caches via change events, but
that's an async race against the immediately-enqueued reindex.
The matrix republish test's direct-HTTP diagnostic confirmed
the symptom: status=202 publish response, but the published URL
served `contains-initial=true contains-updated=false`.

Fix layered with the others:
  - clearLocalCaches() — new public method on Realm. Bulk-
    invalidates #sourceCache and the module cache. Different
    from __testOnlyClearCaches in that it leaves the
    test-only transpile counter alone.
  - handle-publish-realm — between upsertPublishedRealmInRegistry
    and enqueueReindexRealmJob, lookupOrMount the realm and call
    clearLocalCaches on it. For a republish (the bug case), this
    nukes the stale source bytes the reindex would otherwise
    pull. For a new publish, the mount is fresh and the call is
    a no-op.

Combined with the earlier commits this closes the full chain:
  - Cache-Control: no-store on source responses (commit 5e3a2b3) →
    Chromium HTTP cache evicted.
  - disposeAffinity on publish (9c2af8a + ed53ad1) →
    prerender server puppeteer-page host-Loader caches evicted.
  - clearLocalCaches on publish (this commit) → realm-server
    per-Realm #sourceCache + module cache evicted.

Each addresses a distinct layer; together they ensure the next
reindex after a republish renders against the bytes that are
actually on disk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The [CS-11043 ...ms] checkpoint logging + the direct-HTTP fetch
comparison + the pre-assertion sentinel-text log were all added to
diagnose where in the publish-republish chain things were
breaking. They worked — they pinpointed the realm-server
#sourceCache as the actual broken link, which the
clearLocalCaches commit then fixed. With the bug closed, those
console.logs are pure noise on every successful run and clutter
the test's intent.

Removed:
  - the step() helper definition
  - all [CS-11043 …ms] checkpoint calls
  - the direct request.get(publishedRealmURL) call before opening
    the popup (was to discriminate server-side staleness from
    browser-cache; both layers are addressed now)
  - the pre-assertion sentinelLocator.count()/textContent() log
    (standard Playwright assertion failures already show the
    expected/received values)

Kept (structural improvements that make the test more robust, not
just diagnostic):
  - the source-content read-back guard via request.get + expect:
    if postCardSource silently fails, the published-URL assertion
    below would otherwise fail with a misleading message
  - the domain-checkbox isChecked() guard + conditional click:
    defends against the modal losing its checkbox selection on
    close/reopen, which would make the publish click a silent
    no-op
  - the .catch(() => null) on waitForResponse + downstream
    if-guard: lets the test fall through to the URL assertion if
    the response wait is transiently lost, rather than failing on
    infrastructure noise

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@backspace backspace marked this pull request as ready for review May 14, 2026 00:24
@backspace backspace requested review from habdelra and lukemelia May 14, 2026 01:44
@habdelra
Copy link
Copy Markdown
Contributor

are there tests that we can add that demonstrate these new capabilities--like the prerenderer's new dispose affinity endpoint?

Comment on lines +559 to +572
if (prerenderer?.disposeAffinity) {
try {
await prerenderer.disposeAffinity({
affinityType: 'realm',
affinityValue: publishedRealmURL,
});
} catch (e) {
log.warn(
`disposeAffinity failed for ${publishedRealmURL}: ${
e instanceof Error ? e.message : String(e)
} — continuing with publish; stale Loader cache may persist until LRU rotation`,
);
}
}
Copy link
Copy Markdown
Contributor

@habdelra habdelra May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like it might be overkill. there is a clearCache: true option that the /render route for the prerenderer that was created specifically for this purpose: to clear the loader cache for the affinity. this is how we can use the same tab to handle both code changes and instance changes and always use the most recent code. When we do this we destroy the browser context for the realm in teh prerenderer which is bascially like throwing out the baby with the bath water. I'd be more interested to know why this option is not working for you

Copy link
Copy Markdown
Contributor

@habdelra habdelra May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, hopefully the etag should be working for the module updates. if the etag wasn't working i could see how disposing of the affinity would yield a working result, as that would destroy the browser context and thus force the browser to refetch the modules even if the etag didn't change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking at this, Claude agreed with your assessment, we went through several iterations of trying to diagnose the problem, and that did end up not being needed, as I’ve deployed this branch since removing it and publishing still worked.

There are now realm server tests for the new behaviours, too. Can you look again?

backspace and others added 5 commits May 14, 2026 07:22
The dispose-affinity wiring was a symptom-treatment workaround:
disposing the puppeteer pages after a republish made the bug go away
by spawning fresh pages whose host-Loaders had no cached modules. But
the actual cause was upstream — the realm-server's per-Realm
#sourceCache was returning pre-swap bytes to whichever Loader fetched
modules after the swap, regardless of whether that Loader was fresh.

With Realm.clearLocalCaches() (commit ec5ddf9) now invalidating
#sourceCache before the reindex enqueues, the existing IndexRunner
clearCache:true-on-first-render mechanism is sufficient: the Loader
reset re-fetches, the realm-server's source cache is already empty,
and the response carries fresh bytes + fresh content-hash etag.

Removes:
  - handle-publish-realm: prerenderer.disposeAffinity(...) call and
    the prerenderer destructure from CreateRoutesArgs
  - runtime-common Prerenderer interface: optional disposeAffinity
  - prerender-app: POST /dispose-affinity endpoint
  - manager-app: POST /dispose-affinity broadcast
  - RemotePrerenderer: disposeAffinity() client

The internal Prerenderer.disposeAffinity / PagePool.disposeAffinity
methods stay — they're still used by LRU rotation, mid-render cancel,
and the existing prerendering tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three regression tests covering the load-bearing layers of the
publish-republish fix, complementing the matrix end-to-end test:

  - clearLocalCaches drops cached source bytes — warms the source
    cache via two source fetches (miss → hit), calls
    testRealm.clearLocalCaches(), asserts the next fetch is a miss.
    Directly exercises the public surface the publish handler now
    invokes after the FS swap.

  - source response sets Cache-Control: no-store — asserts the
    header on both miss-path and hit-path source responses. Documents
    the Chromium-cache contract the publish flow now depends on; if
    the header is ever dropped from defaultHeaders, Chromium will
    reintroduce the heuristic-cache vector that gave nyuitp2026 a
    37-hour staleness window.

  - republishing reflects updated source content in boxel_index —
    writes a card with title sentinel-initial-<uuid>, publishes,
    waits for that title to appear in boxel_index.head_html;
    rewrites with sentinel-updated-<uuid>, republishes, waits for
    the new title and asserts no row still references the initial
    sentinel. This is the data-layer regression for CS-11043 — if
    clearLocalCaches() is regressed, the second waitUntil times out
    exactly as the production bug would have it. Faster than the
    matrix Playwright test and gives a clearer signal at the DB
    layer specifically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
backspace and others added 2 commits May 14, 2026 09:23
Two corrections from the first CI failure:

  - source response sets Cache-Control: no-store: the precondition
    "first fetch is a cache miss" failed because testRealm.write()
    triggers the indexer, which fetches the source server-side and
    warms #sourceCache before the client's first GET. Drop the warm
    entry via __testOnlyClearCaches() right after write so the
    assertion exercises the genuine miss-path defaultHeaders.

  - republishing reflects updated source content in boxel_index:
    the wait for the initial sentinel timed out because I set
    `attributes.title`, which CardDef doesn't have. Title lives on
    cardInfo.name (which feeds the computed cardTitle). Set
    attributes.cardInfo.name instead and assert against
    search_doc::text — substring-matching the jsonb-as-text means
    we don't have to encode the exact path (cardInfo.name vs the
    derived cardTitle), and search_doc is populated for every
    indexed instance regardless of head-render outcome.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Commit 5e3a2b3 added `cache-control: no-store` to defaultHeaders
in `getSourceOrRedirect`, but serveLocalFile unconditionally sets
`cache-control: '${public|private}, max-age=0'` after spreading
defaultHeaders — the no-store value was clobbered every time. The
test added in the previous commit caught this: the assertion failed
because the actually-served value was `public, max-age=0`.

The reason this never mattered for production:

  - The 5e3a2b3 commit attributed the 37 h staleness to
    Chromium's heuristic HTTP cache, but Chromium only falls back
    to heuristic caching when there's no explicit Cache-Control —
    `public, max-age=0` was already on every source response. With
    max-age=0, every fetch revalidates via etag.

  - The etag is content-hash based (etagBase: cached.contentHash).
    The CS-11043 production failure was that the realm-server's
    own #sourceCache returned stale bytes AND a stale content-hash
    etag together — so revalidation returned 304 for a hash that
    didn't match disk. The clearLocalCaches() fix (ec5ddf9)
    invalidates that cache before the reindex enqueues, breaking
    the chain at its actual cause.

So 5e3a2b3 was based on incorrect analysis (Chromium heuristic
caching) and didn't even take effect at runtime. Removing it makes
the PR a single load-bearing fix (clearLocalCaches) with one test
that exercises it (the republish-into-boxel_index regression) plus
the unit test for clearLocalCaches itself. No dead code.

Also drops the cache-control test from the previous commit since
the contract it was asserting on isn't (and now never was) real.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +1342 to +1358
// CS-11043. Bulk-invalidate this realm's in-process byte caches.
// Called by the publish-realm handler after the FS swap, BEFORE the
// reindex enqueues — so that subsequent source reads (which the
// reindex's prerender fans out across many of) bypass any
// pre-swap bytes the realm still has in `#sourceCache` /
// `#moduleCache`. The Phase-3-PR-2 publish flow relies on the
// NodeAdapter file-watcher to pick up the swap, but that's an
// async-event race against the immediately-enqueued reindex; this
// method makes the invalidation synchronous from the publish
// handler's vantage point. Different from `__testOnlyClearCaches`
// in that it does NOT reset the transpile counter (which is
// test-only diagnostic state, unrelated to byte-correctness).
clearLocalCaches(): void {
this.#sourceCache.clear();
this.#dropAllModuleCacheEntries();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to coordinate with @lukemelia here. this touches on the work to move caching out of memory in the realm server in preparation for horizontally scaled realms. if we had 2 realm servers that needed cache clearing i'm not sure how you would coordinate this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, we talked it over and came up with an approach, I’m going to merge this and implement a multi-realm server fix in CS-11153.

@backspace backspace merged commit fd727d1 into main May 14, 2026
67 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants